Time - 5 minutes
To briefly recap, last week we covered:
One criticism of R is that it is (relatively - trying doing the sums by hand instead) slow. R is a high level programming language (i.e. it has a bunch of predefined functions and automatic data handling etc etc) which make it a great language to use and learn, but can (sometimes) make it a little slow when compared to loewer level languages (like C+). The creators of R have gone a long way to try and solve these issues, and indeed a lot of the R functions you use will actually call C code in the background (and indeed you can write in C in R if you really wanted to, like if you were writing a new package for fast data analysis).
However, whilst a lot of functions have been optimised to run fast, in reality the most likely reason your code is slow (and might take ages to run) is through not-best-practice coding on your part. So this worshop will be on writing fast R code, and - where neccesary - parellalising that code.
Before you start to optimise your code consider: do you actually need to?
This might sound silly but if it takes you 3 hours to optimise your code and saves you 3 minutes of run-time then there isnt much point in doing it! Consider optimising code when you are having to run the same piece of code over and over again, and thus where small savings in the speed of that code can make a big difference in the total run time.
Alternatively, just get a bigger/faster computer to run your code on!
That being said, here are some nifty methods for speeding up your R code.
We are going to look at some easy ways to speed up your code. Hopefully this will be a nice break from all the statistics!
Time - 10 minutes
First off, we need to find out how fast our code is. R has a few different ways of timing code, there is a generic base function system.time which allows you to time expressions:
##long vector
x <- c(1:1e6)
##time a calculation
system.time(sin(x + x^2))
## user system elapsed
## 0.019 0.001 0.020
The “elapsed” value gives us the time in seconds it took to run the code.
Note - we didnt have the values of our calculation returned to us! There is a way around that…
If we want to to include more complex code we can wrap the code in { } and put it into system.time:
##time a calculation
system.time({
##long vector
x <- c(1:1e6)
##calculate the sin of the folliwing
v <- sin(x + x^2)
##put it into a tibble
data_out <- as_tibble(v)
})
## user system elapsed
## 0.026 0.003 0.030
##look at the data
data_out
## # A tibble: 1,000,000 × 1
## value
## <dbl>
## 1 0.909
## 2 -0.279
## 3 -0.537
## 4 0.913
## 5 -0.988
## 6 -0.917
## 7 -0.522
## 8 0.254
## 9 0.894
## 10 -0.0442
## # … with 999,990 more rows
Now you can see we have saved the various components as objects, and can access them as normal.
An alternative (and more powerful) approach is via the microbenchmark library. This allows you to time one, or compare two or more expressions agains one another. It does this multiple times (as there is always some slight differences in how long even the same chunk of code takes to run if you run it multiple times). As above you can include more complex code in the { }, but it is often easier to save the code as a function and run that:
##load the microbenchmark package
library(microbenchmark)
##long vector
x <- c(1:1e6)
##approach 1
f_1 <- function(x){
##calculate the sin of the folliwing
v <- sin(x + x^2)
##put it into a tibble
data_out <- as_tibble(v)
return(data_out)
}
##approach 2
f_2 <- function(x){
##calculate the sin of the folliwing
##without saving objects v and data_out
return(as_tibble(sin(x + x^2)))
}
##compare the speeds of the two functions,
##running each 500 times
benchmarks <- microbenchmark(f_1,
f_2,
times = 1000,
unit = "s")
##look at the results
benchmarks
##Unit: seconds
## expr min lq mean median uq max neval
## f_1 4.0e-08 4.2e-08 4.3639e-08 4.3e-08 4.4e-08 4.13e-07 1000
## f_2 3.3e-08 3.6e-08 3.8964e-08 3.8e-08 4.0e-08 8.85e-07 1000
So we can see that f_2 is about 10% faster than f_1 (mean is lower, median is lower, etc) so if we were to repeat this calculation hundreds of times it would be sensible to use the f_2 function.
Time - 15 minutes
The above example neatly highlights an important way we can simply and easily speed up our R code - by minimising the steps made by R to produce our results. This should be obvious, but is worth point out. In our first function:
##approach 1
f_1 <- function(x){
##calculate the sin of the folliwing
v <- sin(x + x^2)
##put it into a tibble
data_out <- as_tibble(v)
return(data_out)
}
We take the vector x, and then calculate its sin() and save that asv, and then use v to make a tibble. This code works fine but is slower than the second function:
##approach 2
f_2 <- function(x){
##calculate the sin of the folliwing
##without saving objects v and data_out
return(as_tibble(sin(x + x^2)))
}
In this second function we are taking far fewer steps and creating fewer objects (so this will be more memory and processor efficient) however this comes at a cost - readability. For a computer, both of code chunks are equally easy to read, but for a human the first (with its annotation and seperation of calculations) is far easier than the second. This is something that is worth considering when you are writing you scripts - is being more readable of greater importance than being slightly faster…
Task
Write the following chunks of code in a more concise way:
############################################
## example 1
############################################
## vector of values
x <- c(1, 2, 3, 4, 5, 6, 7)
sample(x, 2)
############################################
## example 2
############################################
data("ToothGrowth")
sub1 <- filter(ToothGrowth, len>10)
sub2 <- filter(sub1, supp=="VC")
ggplot(sub2, aes(x=factor(dose), y=len))+geom_boxplot()+theme_minimal()
Time - 30 minutes
The good news is that a lot of R functions are already vectorised for you. “But wait” I hear you say, what is vectorising?
Well the easiest way to show this with an example. Say we want to add the following to vectors together:
3 4 5 2
1 2 3 4
If we were to do this in a non-vectorised way, say by hand, the we would do something like the following:
3 + 1 = 4
4 + 2 = 6
5 + 3 = 8
2 + 6 = 6
To give our new vector of:
4 6 8 6
This isn’t what R does (thankfully) as this would be, as you can imagine, very slow. Instead R can do the following:
vec_1 <- c(3, 4, 5, 2)
vec_2 <- c(1, 2, 3, 4)
vec_1 + vec_2
## [1] 4 6 8 6
Looks the same, but what we have actually done here is this:
3 4 5 2
+ + + +
1 2 3 4
i.e. we have added the first vector to the second vector as single objects, rather than doing the sums independantly for each of the numbers within the vectors. This might not sound like a big deal, but we are adding together very small vectors here (4 numbers in them) and the difference between the two approaches becomes very obvious when we start adding together bigger numbers.
As we said, many R functions are already vectorised, these include:
Let’s demonstrate this by writing a function to do pair-wise addition and compare it a vectorised approach. We are going to use the for() function to write a loop to make this happen. Loops can be very useful ways of iterating through data. To make a loop in R we use the for() function. The for() function takes two arguments, the first argument makes a new object which here we have called i (the name doesnt matter, it could be “custard”), the second item should be a vector of the length we want to iterate along. for() will then repeat the code inside the {} brackets until the specified number of iterations has been completed, and at each iteration i will become the next value in the vector. The easiest way understand this is to see it in action:
## a vector of values
x <- 1:5
##loop along the length of x
for(i in 1:length(x)){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
You can see that above we have made a vevtor, and then the for() loop has run 5 times (specified by the 1:length(x)) and at each iteration of the loop it has used a new value for i taken from the vector 1:length(x). We can then use these loops for a multitude of different purposes.
Going back to our original question (vectorised vs non-vectorised code), we will use the for() loop to iterate along the two vectors, selecting the i-th value in each vector and adding them together:
##non-vectorised addition function
non_vec_addition <- function(x, y){
##define an object to store the values in
z <- numeric(length(x))
##use a loop to add each element of the vectors together in turn
##and then save the output into the correct slot in the z vector
for(i in 1:length(x)){
z[i] <- x[i] + y[i]
}
##return the result
return(z)
}
we can then use this function to do non-vectorised addition of two vectors:
##addition of two vectors
non_vec_addition(vec_1, vec_2)
## [1] 4 6 8 6
To demonstrate how poewerful vectorisation is lets apply our new function to some bigger vectors, and time how long it takes to complete, and compare that to the vectorised version:
##two long vectors
long_vec_1 <- runif(10000000, 0, 1)
long_vec_2 <- runif(10000000, 0, 1)
##time the non-vectorised vs vectorised approaches
system.time({
non_vec_addition(long_vec_1, long_vec_2)
})
## user system elapsed
## 0.879 0.017 0.896
system.time({
long_vec_1+long_vec_2
})
## user system elapsed
## 0.005 0.000 0.005
system.time() is a really useful function for finding out how fast bits of your code are, and testing if you can speed them up. Here you can see that the non-vectorised version takes 0.076 seconds, whilst the vectorised takes a blistering 0.002 seconds. We can use {} to put in a more complex argument than say 1*4. The expression can go over multiple lines, as long as you close the {} again and close the function with a ). Clearly these are both small numbers, but this can all add up when you are doing large calculations, especially if you have to do those calculations hundreds of times!
Well, fair question to ask. In practicality many of the functions you use will already be vectorised, so you don’t have to think about it. However, sometimes we might (for convenience or because we havent thought of a better way) use something like to following to sort data:
##make sure you load in the data.table library to access rbindlist()
##make a blank list
save_data <- vector(mode = "list",
length = length(unique(iris$Species)))
##add some additional data to to the iris data frame
for(o in 1:length(unique(iris$Species))){
##subset the full data set, making a new subset for each species in turn
sub <- subset(iris, Species==unique(iris$Species)[o])
##add in a new column called "outlier"
##we will use this to identify possible outliers in the data
## we initially fill this with NAs
sub$outlier <- NA
##then if the subset data is from the species "versicolor"...
if(sub$Species[1]=="versicolor"){
##...and the value of the Petal.Length column is greater than 5
## then save an "outlier" not in the outlier column
sub$outlier[which(sub$Petal.Length > 5)] <- "outlier"
}
##add the data to our blank list
save_data[[o]] <- sub
}
##bind the list togehter (using rbindlist from the data.table package)
save_data <- rbindlist(save_data)
##see if it worked:
filter(save_data, Species == "versicolor", Petal.Length > 5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species outlier
## 1: 6 2.7 5.1 1.6 versicolor outlier
Ok, so clearly this code does WORK. But in so many ways this isnt “good” code - its slow, its hard to read, and because its so complex there is a higher chance of you making a coding error.
There is a better way to do this!
Task
Use a vectorised approach to speed up this process, and time how long it takes when compared to the above (hint - logical operators are vectorised!).
Ok so lets have a think about what we were trying to do with the original code. In essence we are trying to do a logical argument, so we can do the above with so much more simple code, and make it much faster:
## make a new column
iris$outlier <- NA
##using logical operators
iris$outlier[iris$Petal.Length>5 & iris$Species=="versicolor"] <- "outlier"
So in terms of practical applications to your code - try and use vectorised functions wherever possible, and try and simplify your code as much as possible (both for ease of reading and for speed).
When you define an object using the <- command we are allocating memory to a specific task:
x <- runif(10, 0, 1)
What do you think is wrong with doing the following:
for(h in 1:length(x)){
##calculate the sum of the standard deviation of x
sd_x <- sum(sd(x))
##define a constant
cons <- 3.1512
##add to the h'th item in x:
x[h] <- x[h] + sd_x + cons
}
Task
Write a more efficent version of the above.
Again, our above code works fine, but every time the loop runs we are defining sd_x and our cons again, even though both of these can be defined outside of the loop. This takes extra time and is memory inefficent, so define them before the loop (or function, etc):
##calculate the sum of the standard deviation of x
sd_x <- sum(sd(x))
##define a constant
cons <- 3.1512
for(h in 1:length(x)){
##add to the h'th item in x:
x[h] <- x[h] + sd_x + cons
}
Note - you will need to define my_second_func first as it is used inside my_func, and R reads from top to bottom - like you do!
Time - 10 minutes
Similarly to above, avoid defining functions inside other functions. I.e. don’t do this:
##define a function first
my_func <- function(x, y){
##define a second function
my_second_func <- function(x){
return(x^3)
}
##run the new function
new_x <- my_second_func(x)
##return the new values
return(list(new_x, y))
}
Do this:
##define a second function
my_second_func <- function(x){
##return a new value for x
return(x^3)
}
##define a function first
my_func <- function(x, y){
##run the new function
new_x <- my_second_func(x)
##return the new values
return(list(new_x, y))
}
Time - 5 minutes
Unlike other coding languages you don’t need to preallocate memory in R, however doing so can significantly speed up your your code. For example:
##a blank object
results <- NULL
##a sequence to loop along
loop_nums <- 1:2000000
##time the loop
system.time({
##run a loop
for(j in loop_nums){
##do a calculation
calc <- sin(j + (j^2))
##bind out the results
results[[j]] <-calc
}
})
## user system elapsed
## 1.600 0.066 1.667
Now the pre-allocated version:
##a sequence to loop along
loop_nums <- 1:2000000
##a blank list of length loop nums
better_results <- vector(mode = "list",
length = length(loop_nums))
##time the loop
system.time({
for(j in loop_nums){
##do a calculation
calc <- sin(j + (j^2))
##bind out the results
better_results[[j]] <- calc
}
})
## user system elapsed
## 0.304 0.018 0.323
Thats ~ 10 times faster! Magic.
Time - 70 minutes
When the building blocks to R were developed, computers were typically single cored, however most mobile phones now how 6-8 cores, allowing operations to be carried out in parallel to speed up the processing of complex operations.
When we run R as we have done for the last 5 weeks wehave been using a single core (processor), however there are a number of ways to parallelise your code to allow it to run significant faster (approximatly time/number_cores). I find this particularly useful for running simulations, as I will demonstrate first, but a number of packages employ parallelised versions of single core functions to speed up things like complex model selection (harking back to our GLM/M workbook from last week). We willg go into that too below.
First off lets build a simple three species Lotka-Volterra (LV) model in R. Hopefully you all already know the LV model. This is an example of an differential equation. We are going to use a three species version of this:
\[\displaystyle\frac{dR}{dt} = rR - \alpha N R\] \[\displaystyle\frac{dN}{dt} = f \alpha NR - qN\] Where R is the number of the basal prey species, N is the number of primary consumers, and M is the number of the top predators. The basal species R grows at rate r, alpha is the search efficiency/attack rate of the primary consumers, f is the conversion efficiency of R to new N, and q is the exponential decline rate of the N in the absence of prey (R).
Now lets code this in R:
## load the deSolve package for solving
## differential equations
library(deSolve)
## define a three species LV models
lv2 <- function(t, start, parms) {
##set the start values for each of the species
##basal species
R <- start["R"]
## consumers
N <- start["N"]
##allow R to look within parms for the values of r, a, e, etc
with(as.list(parms), {
## dynamics of the resources
dR <- r*R - a*N*R
## dynamics of the primary consumers
dN <- f*a*N*R - q*N
##return a list of the abundances of each species
##deSolve will then group these into a data frame
list(c(dR, dN))
})
}
##we will pass these two functions tell to the ODE solver
##to tell it to set the abundance of either species to 0
##if the simulated abundance falls to less than 0 individuals
##i.e. we will simulate extinction of the species otherwise
##we will have some abundance values which are very very small
##and very very large
eventFun<-function(t,y,p){
y <- c(0, 0)
return (y)
}
rootfun <- function (t,y,p) {
if(min(y)<1){
y <- c(0, 0)
}
return(y)
}
##we will then put the above function into a wrapper function which includes
##the ODE solver function and some data reshaping and saving out parameter combinations:
lv2_sim <- function(start, parms, time){
##run the simulation
sim <- as_tibble(lsoda(y = start,
times = time,
func = lv2,
parms = parms,
##pass the event function
events = list(func = eventFun,
##yes we want to use a root
##function
root = TRUE,
##tell lsoda to stop the
##simulation if the event
##is triggered
terminalroot = 1),
##and the root function
rootfun = rootfun))
##reshape the data to long format:
longer_sim <- sim %>% pivot_longer(-time,
values_to = "abundance",
names_to="species")
##number of years the time series ran for:
longer_sim$sim_length<-max(longer_sim$time)
##add a column to allow us to easily split the results up later
longer_sim$parms <- paste(names(parms), parms, sep="=", collapse = ",")
##make a list of the simulation results and the
##parameters which made them
res <- list("sim" = longer_sim,
"parameters" = parms)
return(res)
}
##make an object of the parameters of the model
##we will pass this to the ODE solver
parms <- c(r = 0.1, #growth rate of resources
a = 0.01, #feeding rate of primary consumer on resources
f = 0.01, #efficinecy of feeding on resources
q = 0.1 #intrinsic death rate of the primary consumer
)
#define the parameters
start <- c(R = 100,
N = 10)
##set the length of the simulation (100 time steps)
## and the resolution (set size, in this case 0.1 time steps)
time <- seq(0, 100, 1)
##run the simulation
dd <- lv2_sim(start, parms, time)
##plot it
ggplot(dd$sim, aes(x=time, y=abundance, col=species)) +
geom_line() +
scale_y_log10() +
theme_minimal()
Great! So we have a two species LV model up and running, but our parameter values are…uninspiring to say the least. One thing we might be really interested in is how our model behaves over a range of parameter values, rather than just the single ones mentioned above. i.e. we want to set up a set of simulations covering a range of possible parameter values. Let’s go about that:
##make a list where the sequence 0.01 to 1 is
##replicated once for each of the parameters in parms
parms_list <- rep(list(seq(0.01, 0.2, length.out = 5)),
length(parms))
##use expand.grid() to make a data.frame of every possible
##combination of the 8 parameters
all_var <- expand.grid(parms_list)
Task
Look at the top 6 rows of the data, and calcualte the dimensions of the data.frame
##look at the data
head(all_var)
## the dimensions of the data
dim(all_var)
Then we will turn this into to a list where each object in the list is 1 row of the data frame:
##convert that into a list, where each object is 1 row of the data frame
all_var_list <- as.list(as.data.frame(t(all_var)))
Ok great, we have created a list with all of the possible combinations of our 4 parameters save as individual objects - a total of 625 simulations. Now we want to pass each of these combinations of parameters to our model in turn. Because we have a list, we can use the lapply() function to apply a function to a list (durr!). This is a really convenient way of running sumulations
##save out the time it takes to run
time_single_core <- system.time({
##the simulations
res <- lapply(all_var_list, function(x, name_vars, start, time){
try({
##add names to the parameters selected from the list
names(x)<-name_vars
##pass the arguments to the function and run the simulation
dd <- lv2_sim(start = start,
parms = x,
time = time)
##return the data we want:
return(dd)
}, silent = TRUE)
##any additional objects we want to pass to lapply
## listed between the last } and the last )
}, name_vars = names(parms),
start = start,
time = time)
})
##how long did it take?
time_single_core
## user system elapsed
## 2.397 0.053 2.451
Ok so to run these simulations took around 7 seconds on my computer - remember this is for 5 values of 4 parameters (i.e. a very low number of simulations), and this model is deterministic so we don’t need to run multiple realisations! The problem with doing simulations in this ways is that because we are doing a fully factorial set of simulations (i.e. every possible combination) then adding in any new parameters/parameter values quickly explodes the number of simulations to infinity:
The above clearly demonstrates that parallel computing becomes much more important as you get more complex simulations, so lets see if we can speed up our simulations.
The parallel library has a nifty funtion for helping us here, so first off install the package and load it.
First off its useful to know how many cores our computers have:
##show the number of cores
n_cores <- detectCores()
n_cores
## [1] 8
We can then use the parallelised version of the lapply function we used above to run the simulations again
Unfortunatley this is where Windows and Mac differ, so its a choose your own adventure depending on your operating system…
For macOS:
##save the time out
time_multi_core <- system.time({
##run the simulations
res <- mclapply(all_var_list, function(x, name_vars, start, time){
try({
##add names to the parameters selected from the list
names(x) <- name_vars
##pass the arguments to the function and run the simulation
dd <- lv2_sim(start = start,
parms = x,
time = time)
##return the data we want:
return(dd)
})
##any additional objects we want to pass to lapply
## listed between the last } and the last )
}, name_vars = names(parms),
start = start,
time = time,
## here we can specify the number of cores we want to use
## I always use n.cores-1 so the computer doesn't slow
## down too much
mc.cores = n_cores - 1)
})
And for Windows, you need to do a little extra work. First off to set up a cluster (an object defning your number of cores and initiating it):
##make the cluster
cl <-makeCluster(n_cores-1, type="PSOCK")
##then export any functions we have defined to the cluster so we can use them in the simulations
clusterExport(cl, c("lv2_sim",
"eventFun",
"rootfun",
"lv2"),
envir=environment())
And then we can continue as before, but instead using parLapply():
##save the time out
time_multi_core <- system.time({
##run the simulations
res <- parLapply(all_var_list, function(x, name_vars, start, time){
##NOTE need to load in all the packages to each node on the cluster, so:
library(deSolve)
library(tidyverse)
try({
##add names to the parameters selected from the list
names(x) <- name_vars
##pass the arguments to the function and run the simulation
dd <- lv2_sim(start = start,
parms = x,
time = time)
##return the data we want:
return(dd)
})
##any additional objects we want to pass to lapply
## listed between the last } and the last )
}, name_vars = names(parms),
start = start,
time = time,
## specify which cluster you are running it on, i.e. the one we defined above (cl)
cl = cl)
})
Great, lets compare the times.
Task
Calculate the ratio of time taken using a single core to that taken using multiple cores, from the times we calculted above
##divide
time_single_core["elapsed"]/time_multi_core["elapsed"]
## elapsed
## 2.853318
So you can see that the multi core version is around 3 times faster than the single core version.
Let’s plot out out results to have a look at what our model has been up to:
##look at the first object of the results
res[[1]]
## $sim
## # A tibble: 134 × 5
## time species abundance sim_length parms
## <deSolve> <chr> <deSolve> <dbl> <chr>
## 1 0 R 100.000000 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 2 0 N 10.000000 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 3 1 R 91.394497 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 4 1 N 9.995633 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 5 2 R 83.536677 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 6 2 N 9.983051 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 7 3 R 76.366937 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 8 3 N 9.963002 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 9 4 R 69.828937 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 10 4 N 9.936186 65.8 r=0.01,a=0.01,f=0.01,q=0.01
## # … with 124 more rows
##
## $parameters
## r a f q
## 0.01 0.01 0.01 0.01
##bind the results into a data frame
extract_res <- rbindlist(lapply(res, function(h){
##extract and return the simulation values
return(h$sim)
}))
##plot the results
ggplot(extract_res, aes(x = time,
y = abundance,
group = parms)) +
geom_line(col = "white", alpha=0.3) +
facet_wrap(~species) +
##log the y axis
scale_y_log10() +
##dark theme
theme_dark()
You can see that for some of the simulations the abundance of one of the two species drops below 1 (which would indicate an extinction of one species in the real world), so lets just plot the simulations which run till the end of the 100 time steps.
Task
Plot the data as above but only plot those simulations which reach 100 time steps (i.e. recreate the plot below). Hint - look at the results to see how you might be able to filter that data.
##plot only those simulations which run till the end of the
##100 time steps
ggplot(filter(extract_res, sim_length == 100),
aes(x = time,
y = abundance,
group = parms)) +
geom_line(col = "white", alpha=0.4) +
facet_wrap(~species, scales = "free") +
scale_y_log10() +
theme_dark()
## Don't know how to automatically pick scale for object of type deSolve/matrix. Defaulting to continuous.
Bonus task - code a three species Lotka-Volterra model
The three species L-V model will look like this:
\[\displaystyle\frac{dR}{dt} = rR - \alpha N R\] \[\displaystyle\frac{dN}{dt} = f \alpha NR - qN - eNP\] \[\displaystyle\frac{dP}{dt} = zeNP -bP\]
Where R is the number of the basal prey species, N is the number of primary consumers, and P is the number of the top predators. The basal species R grows at rate r, alpha is the search efficiency/attack rate of the primary consumers, f is the conversion efficiency of R to new N, and q is the exponential decline rate of the N in the absence of prey (R). The top predator (P) eats the intermediate predator (N) at a rate defined by the search efficiency/attack rate e of P on N. The population of P then grows in the same way N does, as a function of z (the conversion efficiency of N’s to new P’s), the attack rate (e), and the number of N and P, whilst the population declines exponentially at a rate b.
We used lapply() above to in effect loop through a bunch of parameter values and run the model. There is also a parallelised version of for() which we can use to achieve a similar thing - foreach(). foreach() is more complicated to impliment than lapply()/mclapply() but has its advantages as it is more flexible in terms of outputs (lapply/mclapply return a list, which you can then work on e.g. by using data.table::rbindlist()) and whether the code is run in parallel or not.
The basic form of foreach() is similar to that of the for() loops we used above - we define an object to iterate over, and give each iteration of that object a name (in the below example i) such that at each iteration i is a different value from our object. We then tell foreach() to execute some code, and can specify whether this is carried out in parallel or not. First off, lets look at a non-parallelised version:
##load the foreach
library(foreach)
##an object to iterate over
our_seq <- 1:3
##Not parallelised
foreach(i = our_seq) %do% {
##some calculation
sin(i/i^2)
}
## [[1]]
## [1] 0.841471
##
## [[2]]
## [1] 0.4794255
##
## [[3]]
## [1] 0.3271947
So you can see that the code is easy to understand. How about a parallelised version?
To do this we need to do a bit of pre-amble to tell foreach() how to deal with the parallelisation. Using makeCluster() we make a series of copies of R running in the background, and set the number to detectCores() - 1.
##load the doSNOW package
library(doSNOW)
##make a cluster, defining the number of cores to use
clus <- makeCluster(detectCores() - 1)
##register the cluster with doParallel
##you only need to do this once, then you can use it for the rest
##of your script
registerDoSNOW(clus)
##an object to iterate over
our_seq <- 1:30
##then you can run a parallelised version of of our foreach() loop
looped_result <- foreach(i = our_seq) %dopar% {
sin(i/i^2)
}
##look at the output
head(looped_result)
## [[1]]
## [1] 0.841471
##
## [[2]]
## [1] 0.4794255
##
## [[3]]
## [1] 0.3271947
##
## [[4]]
## [1] 0.247404
##
## [[5]]
## [1] 0.1986693
##
## [[6]]
## [1] 0.1658961
As we said earlier one of the advantages of foreach is its flexibility (see ?foreac()). One nice option is the ability to directly output different types of object (rather than just a list as with mclapply()). For example, below we return a tibble using rbind. We need to load any packages into the environments created by makeCluster() as these are created empty (i.e. without packages), so we use the .packages argument to load any required packages into the newly created local environments:
## combine into a tibble
## see ?rbind for details (hint - it means row bind)
new_tib <- foreach(i = our_seq,
.combine = 'rbind',
.packages = "tidyverse") %dopar% {
##return a tibble object with sin of i
##and the value of i
return(tibble("sin" = sin(i/i^2),
"i" = i))
}
## look at new_tib
new_tib
## # A tibble: 30 × 2
## sin i
## <dbl> <int>
## 1 0.841 1
## 2 0.479 2
## 3 0.327 3
## 4 0.247 4
## 5 0.199 5
## 6 0.166 6
## 7 0.142 7
## 8 0.125 8
## 9 0.111 9
## 10 0.0998 10
## # … with 20 more rows
The above is particularly useful if you are - say - looping through replicates of a simulation and might want to save the replicate number (i) as part of the output to discerne between replicate simulations.
For particularly large/long simulations (minutes, hours, days) we might want to check how far the simulation is through, and we can do this by adding a progress bar via .options.snow:
##make a progress bar, we need to set the maximum to the
##maximum length of the sequence we are iterating across
pb <- txtProgressBar(max = max(our_seq), style = 3)
##
|
| | 0%
##make a function to pass the progress bar to the clusters
progress <- function(n) setTxtProgressBar(pb, n)
##and set the options for snow to include a progress bar
opts <- list(progress = progress)
##load in the options using .options.snow
new_tib <- foreach(i = our_seq,
.combine = 'rbind',
.packages = "tidyverse",
.options.snow = opts) %dopar% {
##return the output
return(data.frame("sin" = sin(i/i^2),
"i" = i))
}
##
|
|== | 3%
|
|===== | 7%
|
|======= | 10%
|
|========= | 13%
|
|============ | 17%
|
|============== | 20%
|
|================ | 23%
|
|=================== | 27%
|
|===================== | 30%
|
|======================= | 33%
|
|========================== | 37%
|
|============================ | 40%
|
|============================== | 43%
|
|================================= | 47%
|
|=================================== | 50%
|
|===================================== | 53%
|
|======================================== | 57%
|
|========================================== | 60%
|
|============================================ | 63%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|=================================================== | 73%
|
|====================================================== | 77%
|
|======================================================== | 80%
|
|========================================================== | 83%
|
|============================================================= | 87%
|
|=============================================================== | 90%
|
|================================================================= | 93%
|
|==================================================================== | 97%
|
|======================================================================| 100%
There is also a very useful .errorhandling option in foreach() allowing you to specify if an error is generated what to do about it (e.g. pass over it, bear this in mind for later).
Let’s quickly summarise what we have learned today:
| Function/operator | Call |
|---|---|
| Time an operation | system.time() |
| Compare speed of 1 or more functions including replicates | microbenchmark() |
| Find the number of cores on your computer | detectCores() |
| Parallelised version of lapply | mclapply() |
| Dark theme for ggplot | theme_dark() |
| Initlise loop using foreach package | foreach() |
| Make a progress bar | txtProgressBar() |
The excellent R book “Efficent R programming” is available for free here
Related to the issues of fitting GLMs and GLMMs you had homework to try and identify the error distributions, link functions, and significant predictor variables for 4 data sets in the Bioinformatics repository.
So, to give you the answers:
| Data | Error distribution | Link function | Significant predictor variables |
|---|---|---|---|
| data 1.csv | Gaussian | inverse | x1, x2 |
| data 2.csv | Gamma | log | x1, x2, x3 |
| data 3.csv | Binomial | logit | x1 |
| data 4.csv | Poisson | log | x1, x2 |
You might have done the right code, but not got the answers above - this is the vagaries of analysing data! even when we know the distributions (and the data are simulated, so not really noisy) then it can be hard to get the “right” answer. Below is the code I used to generate these data, so have a play generating different numbers of predictor variables and amounts of data to see how much data you need to find the correct pattern and distribution.
##load GlmSimulatoR
library(GlmSimulatoR)
###generate data with an inverse guassian distribution
##N is the number of data points,
#the weights are the strength of the predictor variables effect on the y variable
## see ?simulate_inverse_gaussian for more detail
sim_gaussian_inv <- simulate_inverse_gaussian(N = 150,
unrelated = 2,
weights = c(5, 5, 0, 0))
##set the names of the columns
names(sim_gaussian_inv) <-c("y", paste("x", seq(1:(ncol(sim_gaussian_inv)-1)), sep=""))
mod1 <- glm(y~x1+x2+x3+x4, data=sim_gaussian_inv, family=gaussian(link="inverse"))
summary(mod1)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = gaussian(link = "inverse"),
## data = sim_gaussian_inv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.110341 -0.050723 -0.008114 0.038522 0.221212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1012 1.0444 3.927 0.000133 ***
## x1 0.5794 0.3611 1.605 0.110727
## x2 0.3289 0.3361 0.979 0.329387
## x3 -0.2142 0.3548 -0.604 0.546894
## x4 -0.4874 0.3380 -1.442 0.151450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003980511)
##
## Null deviance: 0.59829 on 149 degrees of freedom
## Residual deviance: 0.57717 on 145 degrees of freedom
## AIC: -396.36
##
## Number of Fisher Scoring iterations: 6
##other possible error distributions to simulate:
##simulate_gamma()
##simulate_binomial()
##simulate_poisson()
##simulate_binomial()
Brilliant, hopefully you all found that helpful and managed to use the tools we learnt last week to detect the correct error distributions. This week, we will be talking about…
This week’s homework is to go back through the 6 workbooks you have done so far, and make sure everything makes sense to you and brush up on any bits you feel weak on. This revision session is aimed to prep you for the upcoming coursework…